Application logs: Your eyes in production

Avi Thalker
Payoneer Engineering
8 min readMay 7, 2022

--

Production-support:Hi John, sorry for interrupting you at 1:00 AM but we’ve received complaints from several customers. It seems like they are not able to receive their funds. Can you please check check what the issue is?

John:Hmmm… give me couple of minutes. I will check the logs and see if I can find something.

During your development process, the most common way to track after your code and investigate bugs during runtime is to use the debugger tool provided by your IDE. With the debugger you are able to stop at any line of your code, find out the current values of your variables, investigate the stack trace and understand your application behavior.

Now, after your application is deployed to production, how do you debug it? There are some remote debugging tools available, but these don’t always work for your organization due to security constraints. Enter application logs to the rescue!

Logs are very valuable for your application. With a well-defined structure, strict guidelines of how/when to write the log and the right platform for observability, you’ll be able to achieve the debugging level that you need in order to understand your application’s behavior in production.

Log purpose and usage

  1. Troubleshooting: Logs allow you to understand what happened to the application in times of failures and unexpected results. You need to be able to investigate through them.
  2. Tell the story: Logs should describe the entity journey through the application. Your business logic should be reflected by them. Logs should describe the flow being executed from a high-level point of view, and be suitable for people with no technical understanding of the application.
  3. Statistics and profiling: Although statistics can be measured and collected by dedicated packages and frameworks, it is also possible to get them from the application logs. For example, we can understand how many times a certain flow/API invocation/event has occurred or how much time it took for a certain operation/flow to be executed.

Here are some guidelines and tips that will make your logs understandable, efficient, and one of your most powerful debugging tools in production.

Use common logging framework

There are many popular open-source logging frameworks that you can use.
Based on your programming language, you can find the one that fits your needs, for example NLog for .NET, Log4j for Java, etc.

These frameworks are designed to make the logging process easy and optimized. They enable various features such as easy log formatting, defining log templates, defining different targets for the output of the logs, and more.

It’s definitely easier to use one of the popular frameworks instead of building your own.

Define uniform log templates

In order to make your log investigation process more efficient and easier, it’s important that your logs include relevant data and have a uniform structure. Most of the logging frameworks enable you to define a template for your logs and even automatically populate some of the fields for you.

Here are some of the data fields that should be included in your template:

  • Log level: info/error/warning
  • Message: the log text
  • Source: the name of the class that wrote the log
  • Timestamp
  • Exception message and stack trace
  • Hostname: the host name of the server that the application is running on

Use the right log level

Multiple log levels
Multiple log levels

It’s important to categorize your logs by their severity level. When viewing the logs of an application, you want to know as quickly as possible if everything is working as expected. Dividing the application logs into severity levels can give quick insights if something is wrong in a specific timeframe, even without reading the log’s content.

The suggested log levels are:

  • Debug: log at this level everything that happens in the application that will help you during the debugging process. For example, beginning/ ending of crucial functions, raw data that’s returned from the DB, log inside a conditional scope, etc.
  • Info: at this level, log all actions that are: user driven, business driven or system-specific driven (for example scheduled operations). Logs at this level should tell the story behind the flow being executed.
  • Warning: at this level, log all events that can potentially become an error. For example, retry attempt before failure or long execution duration of an API call. Warning logs are here to indicate a potential error that might occur if the issue isn’t solved.
  • Error: at this level, log all events that occur during failure flow. For example: API call failure, database access error, or failure of your business/user/system flow. Remember, logs at this level need to be observed as they are indicating that something bad has happened. If your application constantly produces logs at this level then you probably have a bug or you chose the wrong level for your log.
  • Critical: this is rarely used and when used, should be used carefully. At this level, log unrecoverable errors which will require external intervention in order to fix. For example, application failures on startup.

Write clear, understandable and meaningful logs

  • Make sure to write meaningful and clear logs, with correct grammar. The log is an integral part of the development, and incorrect logs can quickly lead to development bugs.
  • Always include unique identifiers (UserId, EntityId, etc.) in your log text in order to relate it to a specific business flow. Logs that cannot be related to a flow are useless. Without proper context, these logs are just noise. During an investigation, you need to know on which entity this log was written, and the unique identifier will indicate it for you.
  • For error logs (if they were written due to an exception), make sure you are including both human readable text that describes the error, and the exception technical message and exception stack trace.
  • Don’t be context driven. Sometimes the log is written from a deep internal component of the service, and it's written in a way that makes sense to the specific context it exists in. Unfortunately, when reading the log itself as part of flow investigation, this context is absent, and the log might not be understandable.

The higher you are in the hierarchy of the service, The more your log will describe the user story and be business driven.

The lower you are in the hierarchy of the service, the more your log will describe a technical operation and be development driven.

It’s reasonable that most of the info-level logs will be written from the high-level components of your service and most of the debug-level logs will be written from the low level/qinner components.

Do I need to log this?

Let’s identify some common cases and understand how to log them:

  • Log (Info) every beginning and ending of a business flow with the relevant related data that might be interesting to view when reading the log.
  • Log (Info) every update to an entity that’s done during the business flow. The update log should be written from a high-level component if possible. Avoid doing it from low-level class, since it’s not business driven.
  • Log (Info) every invocation of a third-party service. Include relevant data regarding on which entity you’re making the call. Request object data might be logged too.
  • Log (Info) every event being published from the service. Include relevant data regarding on which entity you are publishing the event.
  • Log (Warning) every retry attempt.
  • Log (Info/Warning/Error) every operation that was blocked due to validation checks of the service.
  • Log (Error) every invocation failure of a third-party service. Include the error message from the third party and the status code (for HTTP calls).
  • Log (Error) every failure of event publishing.
  • Log (Error) every failure of the business flow. This log should be written from the high-level components of your service.
  • Log (Info / Warning / Error) each exception that’s caught and handled.
  • Log (Debug) technical operation such as fetching data from the DB, object builder operations, success of technical/inner operation, etc.

Don’t log too much or too little

Too many logs can cause to lot of noise, making it hard to troubleshoot when needed and damaging your ability to achieve insights.

It also has a performance cost, as writing a log is an I/O operation that’s very expensive in high scales and load.

On the other hand, too few logs will make the troubleshooting process very difficult or even impossible, since you don’t have the right logs in order to understand the whole flow.

Adjusting your logs to an optimal level is a process that can take time. If you’re not sure what’s needed and what isn’t, just start logging. Follow your logs for a period of time, live your service, and you will get a deeper understanding of if the log is necessary or not. If not, remove it on your next version.

Don’t log sensitive data

  • Avoid logging sensitive data like passwords, credit card numbers, social security numbers (SSN), etc.
  • Avoid logging Personally Identifiable Information (PII) such as first names, last names, phone numbers, addresses, etc.

Logs in multi-instance architecture

Investigating logs of a multi-instance system can be little tricky. Your service has multiple instances that work concurrently. Requests can be initiated on one instance and at some point, the continuation of the execution can be on a different instance. How can you follow the logs of your flow when each instance of your service has its own logs file. Your logs are distributed too. You have to jump between multiple files and connect the dots in order to understand which logs are related to each other.

How can we make this easier?

Use a platform that can present you multiple log files in one unified view. This way, you’ll be able to read and search your logs as if you were handling a single log file. Coralogix is one of the platforms that can provide you this ability.

  • Have a correlation ID in each log. The correlation ID should be an identifier that helps you easily find all the logs that are related to a certain flow. This way, even if logs that are related to each other are being written from different instances, they will all have the same correlation ID. For example, let’s say that a service handles a registration request, and this request also contains a RequestReferenceId (unique per request). By making the RequestReferenceId the correlation ID, you can search logs by the RequestReferenceId and view the whole flow of the request through your logs.
  • Make sure to add the name of the machine to your log template. It’s important to know which machine logged the data. Multi-instance architecture can also have bugs that relate to a specific instance while the others are working properly. Having the machine name as part of your log template will provide you valuable information when investigating issues.

Reviewing the logs of a new feature must be part of the sanity checks of the developer and also part of the QA testing flow. Unclear logs, missing logs or too many logs are considered as bugs in the development and should be treated as any other bug.

To conclude

Logs tell the story behind the application. By reading the logs you should be able to understand the business flow that was executed on a specific entity.

Logs are your major tool for understanding and troubleshooting your service workflow in production. You should invest extra time when developing and testing to make sure you’re getting the expected result.

See you in production!

--

--